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•'-I Abstract 

Histogram-based empirical Bayes methods developed for analyzing data for large 
numbers of genes, SNPs, or other biological features tend to have large biases when ap- 
plied to data with a smaller number of features such as genes with expression measured 
conventionally, proteins, and metabolites. To analyze such small-scale and medium- 
scale data in an empirical Bayes framework, we introduce corrections of maximum 
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likelihood estimators (MLE) of the local false discovery rate (LFDR). In this context, 
the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, 
by estimating the prior distribution. The corrections lie in excluding each feature when 
estimating one or more parameters on which the prior depends. An application of the 
new estimators and previous estimators to protein abundance data illustrates how dif- 
ferent estimators lead to very different conclusions about which proteins are affected 
by cancer. 

The estimators are compared using simulated data of two different numbers of fea- 
tures, two different detectability levels, and all possible numbers of affected features. 
The simulations show that some of the corrected MLEs substantially reduce a negative 
bias of the MLE. (The best-performing corrected MLE was derived from the minimum 
description length principle.) However, even the corrected MLEs have strong nega- 
tive biases when the proportion of features that are unaffected is greater than 90%. 
Therefore, since the number of affected features is unknown in the case of real data, we 
recommend an optimally weighted combination of the best of the corrected MLEs with 
a conservative estimator that has weaker parametric assumptions. 

Keywords: empirical Bayes; local false discovery rate; medium-dimensional biology; medium- 
scale inference; minimum description length; penalized likelihood; reduced likelihood; selec- 
tion bias; small-dimensional biology; small-scale inference; Type II maximum likelihood 
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1 Introduction 



1.1 False discovery rates for genomics applications 



In genomics, new technologies facilitate the simultaneous measurement of a wide variety 
of features, up to hundreds of thousands in number. Examples of such biological features 
include genes, locations in the brain, and single-nucleotide polymorphisms (SNPs) in genome- 
wide association studies. A multiple testing problem arises in the analysis of data involving 
N features (Xi,X2, ■ ■ ■ ,Xn) of every individual belonging to two different groups, labeled 
treatment and control for convenience. For the ith feature and a corresponding effect size 
6i, a function T defines the statistic Tj = T (X{) that is used to test the null hypothesis that 
6i = 6 , where 8q is the parameter value corresponding to no effect. For example, a common 
objective in genomics is to discover the genes that are differentially expressed between the 
treatment and control groups of individuals. Thus, gene expression data analysis involves 
testing N null hypotheses of equivalent expression. 

Let Ai denote the variable indicating whether the ith alternative hypothesis is true. In 
the case of a two-sided alternative, Aj = 1 if 0j ^ 8q but Aj = if 0j = 9q. For example, 
Ai = 1 means the ith feature is affected by (or associated with) the treatment, disease, or 
other perturbation. The ith null hypothesis corresponds to a discovery of an effect if the 
statistic Tj falls within some rejection region T, in which case, the ith null hypothesis is 
rejected. A discovery of an effect is a false discovery if there is no effect (Ai = 0); otherwise, 
it is a true discovery (Ai = 1). 



The terminology follows Benjamini and Hochberg (1995), who introduced the false dis 



covery rate (FDR) as an error measure for multiple testing. Many variants of the FDR can 



be found in literature, including the Bayesian FDR (Efron and Tibshirani, 2002) or nonlocal 



FDR (NFDR) QBickelj [Mid) ) and the local FDR (LFDR) ( [Efron et al.[[200l| ). In particular 
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the NFDR is the probability that a null hypothesis is true, conditional on its rejection: 

E (N (TO) 



^(T) = Pr (A 4 = 0\T i eT) 



E(N + (T)) 



where N (T) denotes the number of false discoveries and iV + (T) denotes the total number 
of discoveries ( |Efron , 2010). is used to abbreviate ^pevSrjq, pseudo/false). The LFDR for 



the zth feature is defined as the probability that the null hypothesis is true given the statistic 



ti, the observed realization of Tj = T(Xi) (Efron, 2010). That is, 



ipi = V ({ti}) = Pr (Ai = 0|Tj = ti) , (1) 

which assumes that Tj has a common probability density function gg conditional on the 
null hypothesis that 9^ = 8q and another probability density function g a \ t conditional on the 
alternative hypothesis that 9i ^ 8q. According to Bayes's theorem, 

^ = P(6 l = e \U) = n -^M, (2) 

where ir = P (0, = 9q) is the expectation value of the proportion of null hypotheses that are 
true and g (ti) is the marginal probability density of the test statistic: 

9 (U) = TToflfeo (U) + (1 - 7T ) g-alt (U) ■ (3) 

As 7To and g (ti) are unknown, they are estimated with empirical Bayesian methods to obtain 
the estimated LFDR by making substitutions into equations (|2])-([3j. 

1.2 Motivation and overview 

While high-dimensional biology involves measurements over numerous features, sometimes 
millions in number, small-dimensional biology involves measurements over fewer features. 
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Smaller-scale inference problems arise not only when the total data set represents a small 



number of genes, proteins, metabolites, voxels, or other features (e.g., Seifert et al. 2010), 
but also when there are subsets of a large number of features that have something in common 



that distinguishes them from the other features in the data set. For example, Efron (2008 



§7) estimated the LFDR for each voxel as a member of a reference class of 82 voxels at the 
same physical location. The measurements of the other 15,461 voxels are less relevant to the 
truth of a null hypothesis corresponding to a voxel in the smaller reference class. 

Unfortunately, the statistical methods that have been successfully applied to large-scale 
inference problems are not always directly applicable to inference problems involving con- 
siderably smaller dimensions. In particular, in the estimation of LFDR, commonly used 
methods of estimating the unknown parameters 7r and g (tj) in equations (|2| and ^ in- 



volve the histogram-based estimation of g a \ t (t«) (e.g., Efron, 2004, 2007). While this is 



highly reliable for data sets with several thousand features (Yanofsky and Bickel, 2010 



Montazeri et al. 2010), it has a high bias for data sets with small numbers of features. 



Therefore, special statistical methods are required when the number of features is too large 
for conventional hypothesis testing and yet too small for methods developed for an extremely 
large number of features. Hence, we propose new methods for the estimation of the LFDR 
in small-scale inference problems. 

This paper is organized as follows. First, Section [2] recalls methods of eliminating a 
nuisance parameter by reducing the data vector Xi of the zth feature to a statistic T (xi) 
of smaller dimension. Section [3] reviews certain known LFDR estimators and presents the 
proposed LFDR estimation techniques. The application of the new LFDR estimators to a 
data set with 20 proteins is described in Section 4. The new LFDR estimators are then 
tested and compared using simulated data sets, as described in Section 5. Finally, Section 
[6] concludes the paper with a discussion. Asymptotic results are provided in Appendices A 
and B to explain the information-theoretic background behind one of the new estimators 
and to relate it to maximum likelihood estimation, respectively. 
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2 Data reduction and likelihood 



Let x G X be a vector of measurements of one feature. Note that since only one feature 
is considered in this section, the subscript "i" is not used, except in Example |3j where 
a generalization to N features is shown. The observed data vector x G X is considered a 
realization of the random variable X of probability distribution Pg\ that admits a probability 
density function fg t \ with respect to some dominating measure, where 9 G is the parameter 
of interest and A G A is the nuisance parameter. In the case of discrete X, the density 
function is defined with respect to the counting measure on X. For some known 8q G 0, we 
have 6 = 6q under the null hypothesis or narrow model and 9 ^ 6q under the alternative 
hypothesis or wide model. 

The following two types of likelihood correspond to different ways of reducing a vector x 
to a scalar statistic and of eliminating the nuisance parameter. Which of the two methods 
is appropriate depends on the original parametric family {fe,x '■ & G 0, A G A} and on which 
parameter is of interest. 

2.1 Conditional likelihood 

Consider the functions S and T such that S (X) and T (X) are statistics that together 
contain all the information in X. If S (X) does not depend on 6 and if the probability 
density function gg = fg(»\S(X) = S (x)) of the data conditional on S(x), the realized 
value of that statistic, does not depend on A, then the function £ defined by 



I (6) = g e (T (x)) = f e (T (x) \S (X) = S (x)) 



(4) 



is called the conditional likelihood function given S (x). In analogy with equation (|5]), Severini 



(2000, §8.2.1) has 



fe,x (x) = f e ,\ (S (x) ,T (x)) = g e (T (x)) f e ,\ (S (x)) 
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where fg ^ can denote the probably density function of X, (S (X) , T (X)), or S {X), depend- 
ing on the context. 



Example 1. (Severini, 2000, Example 8.47). Suppose that Xi is binomial (n 1? ^x), X 2 is 



binomial (712,712), and X\ is independent of X2. The parameter of interest is 



9 = Ioe 



7Tl 



1-TTi 



where A is the nuisance parameter 



A = lot 



712 



1 — 7T 2 



Then, 



log L {9, A) = a?i0 + S {x l ,x 2 ) A - ni log (l + e e+A ) - n 2 log (l + e A ) , 



where S(xi,x%) = X\ + X2 = s is sufficient. Then, taking T(xi,X2) = £1, the conditional 
log-likelihood function given S(xi,X2) is 



log* (0) = log S (xx) = Oxi - \ogK{6) 



where 



K{9) 



min(ni ,s) 

£ 

j'=max(0,s— 712) 



«2 



Conditional likelihoods are generally available whenever the parameter of interest is a 



natural parameter of an exponential family (Pawitan, 2001, §10.3). For details, see Severini 



(2000, §8.2.4). A recent application of the conditional likelihood function to genomics data 



can be found in Yang et al. (2011). 
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2.2 Marginal likelihood 

Let T be a measurable function on X . If, for each 9 G 0, the probability density function gg of 
the statistic or reduced data T (X) does not depend on the value of A, then £ (9) = gg (T (x)) 
defines the marginal likelihood function i. 

If, in addition, the conditional distribution of X given T (X) = T (x) does not depend 
on 9, then T (X) is called sufficient for 9. In that case, no information about 9 is lost in 
replacing X with T (X): 

fe,x(x) = ge(T(x))f e , x (x\T(X) = T(x)) 

= g e (T(x))f x (x\T(X)=T(x)) (5) 
= Cg e (T(x)), 

where C is constant in 9. The constant is unimportant because it drops out of likelihood 
ratios: 

fo ltX (x) _ Cg 9l (T(x)) _£(6 1 ) 
fe ,x(x) Cg 0o (T(x)) 1%) 

for any value of A G A. 

Example 2. Suppose x and y are vectors of m and n values that realize the random variables 
X and Y of independent components drawn from normal distributions of unknown means 
£ and i], respectively, and of a common unknown standard deviation a. The parameter of 
interest is the inverse coefficient of variation defined by 9 = (£ — rj) /a with 9 = as the null 
hypothesis and 9 ^ as the alternative hypothesis; the parameter space here is = R 1 . A 
suitable statistic for data reduction is the two-sample t statistic 

a (x, y) x/m 1 + n 1 

where £, 77, and a 2 are the usual unbiased estimators. Then gg (T (x,y)), the probability 



density of T (X, Y) evaluated at the observation (x, y), is the noncentral Student t probability 
density with m + n — 2 degrees of freedom and noncentrality parameter (rrT 1 + n^ 1 ) 9. 

The next example encompasses data of multi-dimensional biology 

Example 3. Example [2] is extended to N genes, proteins, or other biological features such 
that Xi ~ N(£i, £i jm ) and Yi ~ N (77,, Ej in ) correspond to the observed outcome (xi,yi) for 
the ith feature, where i — 1, . . . , N and is the diagonal covariance matrix of determinant 
of fc ; thus, Oi is the standard deviation of independent measurements of feature i. If whether 
or not there is an effect on feature i is much more important than the direction of that effect, 
the parameter of interest for feature % may be 

0% = % -Vi\/<T» ( 7 ) 

the absolute value of the inverse coefficient of variation, with ^ = as the null hypothesis, 
6i > as the alternative hypothesis, and = [0, oo) as the parameter space. Then T (xi, yi) 
is the absolute value of the two-sample t statistic for (x^yj) according to equation (|6]), and 
T (Xi, Y,j) is distributed as the absolute value of a variate from the noncentral Student t distri- 
bution with m+n—2 degrees of freedom and noncentrality parameter 5, = (m _1 + n^ 1 ) Qi. 
Thus, the density gg. (T(xt,yi)) for the ith feature is the probability density of T(Xj,Yj) 
evaluated at (xi,yi). Bickel| ( 2011b[e ) illustrated different methods of penalized maximum 



likelihood estimation of the LFDR under this model. 



Severini (2000, §8.3) and Schweder and Hjort (2002) provide additional examples of 



the marginal likelihood, also called the reduced likelihood and not to be confused with the 
likelihood integrated with respect to a prior distribution. 
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3 Local false discovery rate estimation 

As mentioned in Section [TJ previous estimators of FDR and LFDR are highly biased for a 
moderate or small number of hypotheses. We present several strategies in this section to 
reduce that bias. 

3.1 Previous LFDR estimators 

In this subsection, we review the previous LFDR estimators that lay the foundations on 
which our new estimators are constructed. 



3.1.1 LFDR estimates based on other false discovery rates 



Recall from Section [T] that the ith null hypothesis is rejected if the statistic t{ falls within 
some rejection region T. To avoid the specification of such a rejection region T, an estimated 
q-value q (pi) is commonly calculated for the ith p-value pi among the N p-values. The 
rejection region T a is a function of the significance level a, the usual Type I error rate of 
rejecting the ith null hypothesis if and only if pi < a; thus, the estimated q-values, herein 



called q-values to follow contemporary terminology (Hong et al. 2009), are given by 



q (p^ = min pFDR (%) ; [i = 1, . . . , N] 

<*e[pi,i] 



where pFDR(7^J is an estimate of the positive FDR (pFDR) on the rejection region T Pi 



(Storey, 2002). Thus, the q-value is the lowest estimated pFDR at which the zth null 



hypothesis is rejected. Because the latter effectively uses 1 as an estimate of 7To, it will be 
called QV1 in order to distinguish it from q(pi), which is called QV. 

In addition, conservative LFDR estimators based on the binomial distributions have been 



proposed by Bickel (201 Id). The estimator that Bickel (20 lid) called the "MLE" is renamed 



in this paper to avoid confusion with the estimator addressed in the next subsection. We 



denote the version that uses the estimate of ttq described in Storey (2002) as the binomial 
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based estimator (BBE) to distinguish it from BBE1, which instead uses 1 as an estimate of 

7T . 



3.1.2 Maximum likelihood estimator 

The maximum likelihood estimator (MLE) described in this subsection will be called the 



leave- zero- out (LOO) method for reasons given in Section 6.1 The LFDR is estimated 
under the assumption that both the null-hypothesis density function gg and the alternative- 
hypothesis density function g a \ t of equations ([2j)-((3]) are members of {gg : 9 E 0}, a paramet- 
ric family of probability density functions indexed by the interest parameter 9, which is a 
member of some parameter space 0. Thus, g a \ t = <?0 alt , where a i t G is unknown and not 
equal to the known 9$ € 0. Any nuisance parameter must be eliminated, perhaps by using 
one of the two methods explained in Section [2] 

For the ith feature, the data vector Xi is reduced to a scalar statistic ti, as in Examples 
[TJJ3J Therefore, gg (ti) and gg M (U) denote the probability densities for the reduced data 
under the null hypothesis and the alternative hypothesis, respectively. The true value of the 
LFDR for the ith feature is, according to equations ((2|-([3]) and g a \ t = gg a]t , 

^ = Mgg ( 9 ) 

n geo (*<) + (1 - *"o) g 0alt (U) ' 

which is unknown since 9 a \ t and 7r are unknown. 

The LOO method involves the estimation of the parameters ttq and 9 a \ t . These estimated 

parameters (^9 L0 °, tTq 00 ^ are the maximum likelihood estimates of the true parameters given 

by 

JV 

(fl L0 °, 7r L0O \ = arg sup FT (n o g 0Q (U) + (1 - tt ) g e (U)) . (10) 
x ' (e,7r >eex[o,i] ~~[ 

Therefore, with substitution into equation the estimated LFDR for the zth feature is 

?loo ^o°°9e {ti) . v 

Wi ' ^°gg (U) + (l-^ 0O )9g^o(Uy 1 } 
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This estimator has been used with marginal likelihood (Yang and Bickel 
and conditional likelihood (Yang et al. , 2011). Similarly, Muralidharan ( 
the LFDR by maximizing the likelihood over exponential families. 



. [2010tpckell|2011e[ ) 
(2010) had estimated 



3.2 New LFDR estimators 

Here, 5 novel LFDR estimators are proposed: 3 are corrected MLEs, and the other 2 are 
related to the BBE. The corrected MLEs are based on equation ([2]). The fourth technique is 
an approximation of the BBE, and the last new estimator is a combination of the BBE and 
one of the corrected MLEs. 



3.2.1 Corrected MLEs 

The three methods presented here correct the bias of the LOO method that results from 
using the same statistic to evaluate the density functions and to estimate 7r and 9 a \ t . 
This is accomplished by removing dependence of the estimators on t{ prior to evaluating the 
density functions at tj. While that negative bias vanishes as the number of features increases 
(Appendix B), it can be unacceptably large for small numbers of features. 

The first corrected MLE is called the minimum description length (MDL) method. Al- 
though the method was inspired by the MDL principle (Appendix A), the general idea of 
estimating a prior on the basis of exchangeable features other than the feature under consid- 



eration is implicit in Goodman (2004); cf. Gastpar et al. (2010) and J. Cuzick's discussion 



of Aitkin (1991). The MDL method uses modified estimates of parameters tc and 9 a \ t for 
the ith feature, denoted as ^#f DL , ^ DL ^ for i £ {1, . . . ,N}. These estimated parameters 
are obtained by maximizing the likelihood function: 



A' 



^mdl^mdiA arg sup Yl (n g do (U) + (l-n )g e (U)). (12) 

' <0,7ro}eex[o,i] i=Wi 



Note that the product is obtained over all features except for the iih feature. Accordingly, 
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the MDL estimator of LFDR for the iih feature is given by 



? mdl = n oi yvo /io\ 

K oi 9e [H) + [i - n 0i ) 9jMDL [ti) 

The second corrected MLE estimator, called leave-one-out (L10), is the same as the MDL 
except that the LOO estimate of tt is used instead of vr^ DL - Therefore, in LIO, three steps 
are involved. First, the parameters ^ L0O ,7Tq 00 ^ are calculated from the likelihood function 



MDL, 



(10) involved in the LOO method, which includes all the features. Second, similar to MDL, 
the likelihood function involving all features, except for the ith feature, is maximized for 
every feature using the 7T"o°° obtained in the previous step. Therefore, in this step, the 
interest parameter for all i 6 {1, ... , N} is estimated as 

N 

^ L1 ° = argsup ] I (ir L0 % (t J ) + (l-7r L0O )^(t J )), (14) 
leading to the LIO estimator of LFDR for the zth feature: 



^-L0O„ 

% " fidget (U) + (1 - 4°°) {U) ' 1 ' 

The MDL and LIO strategies eliminate bias from a double use of feature data. However, 
when there is only a single affected feature, the MDL and LIO do not use any information 
about 6 ait to estimate the LFDR of the only affected feature, introducing considerable bias 
in estimating # a i t . 

To overcome this defect, we introduce the third corrected MLE, called the leave-half-out 
(Ly20) estimator. Like LIO, Ly20 includes information about the ith feature through 7Tq 00 ; 
furthermore, half of the information about each left-out feature is also included in the Ly20 
through the likelihood function. Such a function is a weighted likelihood function, where the 
contribution of the ith feature to the overall likelihood function is corrected by a weight Wij 



13 



given by 



1 



Wu v 



I/ + (jV_l)' u, *iW I/ + (j V _ 1 ) 
where z/ G [0, 1] is the information (log-likelihood) weight of ti relative to each t,- for the 



[J + A , 



N 



purpose of estimating the parameter of interest. Thus, the weights satisfy Yl w ij = 1- The 

i=i 

^-weighted likelihood function for feature i is 



JV 



0^; ti, i/) = J] Mft, fo) + (1 - *o) <?e alt fe)) 



and the maximum ^-weighted likelihood is 



(16) 



argsupLi(7r , 9; t h u), 



(17) 



degenerating to the LOO and LIO estimators when v — 1 and z/ = 0, respectively. Thus, 

9\ may be considered the leave-v-out (Lz/O) estimator. 

The proposed U-foO method includes exactly half of the information about the ith feature 

in its likelihood function by setting u = 1/2. Therefore, the new estimator fff 1 for all 

i G {1, . . . , iV} is given by the maximization of the weighted likelihood function according 

}Li/ 2 o^Loo\ the LFDR for the 



to equations ( 16 )-( 17). With such estimated parameters (9 



ith feature (V^ 20 ) can be estimated with equation (15), replacing Of 10 with #^ i/2U , and 



3L1/2O 



analogously for ip\ u ° given any v between and 1. 



Weighted likelihoods have been reviewed by Hu and Zidek (2002) and applied to the 



quantification of evidence by Bickel (2011b). 



3.2.2 BBE-related LFDR estimators 



A method for approximating the BBE (Bickel, 2011d) is also presented here. BBE attempts 



to estimate the LFDR more conservatively than q-values, which were not originally designed 
for LFDR estimation. In this section, we denote the q-values as q, which refers to either QV 
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or QV1 (see Section [3J), and p t denotes the rank of the q- values corresponding to the ith 
feature, such that qm < q<2) < ••• < Q(N)- The new proposed method directly assigns twice 
the rank of the q- value q(2 Pi ) to the LFDR estimate of the ith feature with the corresponding 
q-value q( Pi )- Therefore, we define (estimated) r-values as 



r(qi 



?( 2 «) ifp,<iV/2 
1 i{ Pl >N/2. 



We employ analogous notation for r-values, i.e., RV when it uses QV and RV1 when it uses 
QV1. Our aim is to verify that RV and RV1 approximate BBE and BBE1, respectively. 
Finally, for reasons given in Section 6.1[ we combine BBE and MDL into an estimator 



that leverages the strengths of each. Specifically, the MDL-BBE is the linearly combination 



of the other two estimators with weights that are optimal for the hedging game of Bickel 

poTTc). 



4 Application 

In Alex Miron's laboratory at the Dana-Farber Cancer Institute, the abundance levels of 20 
plasma proteins of 55 women with HER2-positive breast cancer, 35 women with ER/PR- 



positive breast cancer, and 64 healthy women (Li, 2009) were measured. The respective 
data vectors ER2 , . . . , x™ R2 , xf R,/PR , . . . , x^jf ,/PR , yi, ■ ■ ■ , 2/20 were created by adding the 
first quartile of the abundance levels (over the 64 healthy women and over all proteins) 
to each abundance level and by taking natural logarithms of the resulting sums; similar 



conservative prepossessing steps have worked well with gene expression data (Bickel, 2002). 

The preprocessed data were modeled as normally distributed, as illustrated in Example 
3} Following the notation of the example, £ RER2 , £f R//PR , and 77, are the expectation values 
of X 2 HER2 , xf R ^ FR , and Yi, respectively, and are as such interpretable as population levels 
of the abundance of protein i. The parameters of interest are #her,2 _ |^ HER2 _ ^| j (Ji anc l 
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e 



ER/PR 



.ER/PR _ 
Si 'It 



/<Ji, the standardized levels of the ith protein's abundance relative 
to the healthy controls. In this context, the LFDR of each protein is a posterior probability 
that its average abundance level is not affected by cancer. 

The data were analyzed according to the distributions of T (X 4 HER2 , Yj) and T ^Xf R,/PR , Y{ 
given in Example [3} The methods of estimating the LFDR described in Section 3.1 



were 



applied to the proteomics data, namely, MDL, LOO, LIO, L1/2O, BBE, BBE1, RV, RV1, 
and MDL-BBE. The results are shown in Figures [T] and [2j which represent LFDR versus 
the estimated protein abundance ratio and p-value, respectively. All figures show results for 
the HER2-positive and ER/ PR-positive groups separately. The volcano plot (Figure [TJ indi- 
cates that the proteins most affected by cancer, showing estimated abundance ratios furthest 
from unity, have LFDR estimates close to zero, while higher LFDR estimates correspond to 
proteins with estimated abundance ratios close to unity. From the results shown in both 
figures, we can see that the selection of the LFDR estimator is crucial because for thresholds 
of the estimated LFDR between and 0.2, many proteins would be considered affected or 
unaffected by cancer, depending on the method. BBE1 and RV1 were omitted from the 
figures to ensure legibility. 



5 Simulations 



In this section, the performance of the LFDR estimators described in Section 3.1 is compared 
using simulated protein abundance data. Such methods are MDL, LOO, LIO, Ly20, BBE, 
BBE1, RV, RV1, and a combination of MDL and BBE. The design of each data set is 



patterned after that of Sections |3. 1.2 and |4} It consists of abundance levels of N proteins 



for two groups, sick and healthy, each containing 5 individuals, for total of 10 abundance 
levels per protein. For the ith protein, the log-abundance data are drawn from a normal 
distribution with variance a 2 = 1 and mean equal to 0, except for the proteins affected by 
the disease, which have mean £ a i t > in the sick group. To represent both barely detectable 
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Figure 1: Volcano plot representing LFDR for protein abundance of both groups, HER2- 
positive and ER/PR-positive women, relative to healthy women, estimated by using different 
LFDR estimators and represented versus the estimated protein abundance ratio. The LFDR 
estimators are MDL, LOO, LIO, V/2O, BBE, RV, and MDL-BBE. 




Figure 2: LFDR for protein abundance of both groups of women with breast cancer, HER2- 
positive and ER/PR-positive, relative to healthy women, estimated by using different esti- 
mators and represented versus p-value. The LFDR estimators are MDL, LOO, LIO, L x /20, 
BBE, RV, and MDL-BBE. 
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and highly detectable differences between the null and alternative distributions, we consider 
two values for the effect size, a low value (£ a it = 1.5) and a high value (£ a it = 4) relative to 
the standard deviation (er = 1). Therefore, we have two values for the positive noncentrality 
parameter 5 a i t = {m~ l + n^ 1 ) 9^ t , where, in agreement with equation ([7]), 

0alt = |U-O|/<7 = U, (19) 

and m and n are the numbers of individuals in the sick and healthy group, respectively 
(m = n — 5). Therefore, the distribution of the affected proteins in the sick group has 
Jait = 2.4 in the low-effect simulations and 5 a it = 6.3 in the high-effect simulations. By 
contrast, the noncentrality parameter values are for all the unaffected proteins and for all 
the proteins of the healthy group. Then, the LFDR estimators are compared with regard to 
the number of proteins in each data set and the number of affected proteins for 20 simulated 
data sets of each configuration. 

To facilitate the comparison among the different LFDR estimators and for specific ver- 
ification of the similarities between BBE and RV and BBE1 and RV1, we estimated each 
estimator's bias, the mean (over all proteins) of the expectation value of the difference be- 
tween the estimate and true LFDR. For each LFDR estimator, that bias is estimated by the 
mean difference between the estimated LFDR and the true LFDR, where the mean is over 
the simulations as well as the proteins. Thus, 60 LFDR estimates are averaged when the 
data set has 3 proteins (mean over 3 proteins and over 20 simulations) and 300 LFDR values 
when the data set has 15 proteins (mean over 15 proteins and over 20 simulations). The true 
value is calculated using equation (|9| with the proportion of proteins that are unaffected as 



7To and with the value of 9 a \ t given by equation (19). 

The results are shown in Figure |3j where the estimated bias of the LFDR is represented 
as a function of the number of affected proteins, for each number of proteins in the data set 
and for two different levels of detectability. Although we studied the behavior of the methods 
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separately for affected and unaffected proteins, the figures show the estimated biases of the 
LFDR averaged over all proteins. Figure |3j plots (a) and (b), show the results for a data set 
with 3 and 15 proteins, respectively, and for the high level of detectability. Plots (c) and (d) 
are the same except that they correspond to the low detectability level. For better legibility 
of the figures, RV1 and BBE1 are not displayed because they have excessively high estimated 
bias averaged over either affected or unaffected proteins or averaged over all proteins. 

It can be seen from Figure [3] that the LFDR estimates depend on the number of pro- 
teins, the number of affected proteins, and the detectability level. Note that in the plots, 
the contribution of the bias from the affected proteins increases as the number of affected 
proteins increases because the protein-averaged results are, in effect, weighted according to 
the number of affected or unaffected proteins. The estimators BBE1 and RV1 are not dis- 
played because the values of their (positive) biases are much higher than those of the other 
estimators. However, the biases of RV and BBE are more moderate, especially when few 
proteins are affected. 

6 Discussion 

6.1 Evaluation of the LFDR estimators 

Some differences in estimator performance depend on the value of <5 a i t , the noncentrality 
parameter. LOO and Ly20 work very well when S a \t is high, regardless of the number of 
features in the data set (Figure [3j (a)-(b)) and when there is at least one affected feature. 
When there is no affected feature, both estimators have highly negative bias (about —0.25). 
When dgit is high, MDL and LIO perform similarly to LOO and Ly20, except when there 
is only one affected feature in the data set, in which case MDL and LIO have excessively 
high positive biases for the affected feature. Those biases are not seen in the plots since they 
are averaged over all features. These biases result from the fact that MDL and LIO do not 
use the data of the given feature to estimate S a \ t . Thus, MDL and LIO cannot effectively 
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estimate <5 a i t when only one feature is affected, which results in such a noticeable high positive 
bias when £ a i t is high. L 1 /20 overcomes that drawback by including half the information on 



the unique affected feature in its likelihood function (16). In contrast, BBE and RV are less 
biased than the other estimators when no features are affected. However, the values of their 
conservative (positive) biases increase with the number of affected features. On the other 
hand, when S a \ t is low (Figure [3j (c)-(d)), all the corrected MLEs are negatively biased when 
there is no affected feature, and BBE and RV have positive biases. 

In addition, by comparing the four plots in Figure |3j we can see that BBE and RV are 
extremely similar; only slight differences appear in cases of few affected proteins. Moreover, 
the methods gave similar estimates in the application to real protein data (Figures [2] and 

00 

Therefore, since BBE-related estimators show a small bias for no or a few affected features 
and since corrected MLEs perform better when most of the features of the data set are 
affected, we consider a new LFDR estimator as the weighted combination of representative 
estimators of each type (corrected MLEs and BBE-related estimators): the MDL and the 
BBE. Based on performance with 3 features and low <5 a i t , we choose to combine MDL and 
BBE because MDL has the lowest absolute value of the bias among the corrected MLEs and 
because the BBE is simpler than the RV but is similar in performance. Then we applied the 
same MDL-BBE method to all cases. The MDL-BBE is an optimal linear combination of 



the MDL and the BBE (Section 3.2.2) 



To summarize the findings for each method and each total number of features in the data 
set, Table [T] reports the most extreme values and the median of the biases for 7r > 90% over 
the numbers of affected features and over both values of S a it. We can see from this table 
that these values are very similar among the methods of the same type. Corrected MLEs 
have the most negative biases, and BBE-related methods have the highest positive biases. 
As Table [T] indicates, the MDL-BBE succeeds in substantially reducing the negativity of the 



however, we found in unpublished work that these estimators diverge more for an application to a 
large-scale proteomics data set. 
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LFDR 

Estimators 


all tt 


7T >90% 


3 features 


15 features 


3 features 


15 features 


MDL-BBE 


0.13 [ 0.10, 0.41] 


0.01 [ 0.20, 0.31] 


0.1 


-0.13 [-0.2, 0.01] 


BBE 

RV 


0.19 [ 0.07, 0.69] 

0.18 [-0.08, 0.69] 


0.01 [ 0.11, 0.55] 

0.00 [-0.13, 0.55] 


0.07 

-0.08 


-0.07 [ 0.11, 0.01] 

-0.09 [-0.13,-0.02] 


MDL 

LOO 
LIO 

Ll/20 


0.02 [ 0.13, 0.12] 

-0.02 [-0.22, 0.16] 
0.02 [-0.17, 0.20] 
-0.02 [-0.22, 0.18] 


0.00 [ 0.30, 0.07] 

-0.01 [-0.34, 0.08] 
0.00 [-0.30, 0.08] 
0.00 [-0.32, 0.08] 


0.13 

-0.18 
-0.17 
-0.17 


0.19 [ 0.30, 0.02] 

-0.17 [-0.26,-0.01] 
-0.16 [-0.24, 0.03] 
-0.17 [-0.24,-0.01] 



Table 1: Median [minimum, maximum] values of the biases of all the LFDR estimators over 
all 7To and over all ttq > 90 %, over the numbers of affected features, and over both values of 
the noncentrality parameter. Separate values are given for each total number of features in 
the data set. 

worst-case bias of the MDL and substantially reducing the highly conservative worst-case 
bias of the BBE. In short, the MDL-BBE does not suffer from the main drawbacks of the 
other estimators. 

Since the focus on the worst-case performance can lead to an overly pessimistic assessment 
of small-scale estimation of the LFDR, the median values are also reported in Table [T] They 
indicate that while estimation is somewhat unreliable for some estimators when there are 
only 3 features, it is reliable for all estimators when there are 15 features. Even so, the 
reported absolute values of the biases should be regarded as lower bounds since they were 
computed under the independence of features. Further, since the simulations use the same 
family of distributions as the MLE-related estimators, they perform better in the simulations 
that they would with real data. 

6.2 Conclusions 

In this paper, we proposed several LFDR estimators to give reliable results for small-scale 
inference. We compared them on simulated data sets and illustrated their use on a protein- 
abundance data set that illustrates that different conclusions would be drawn on the basis of 
different estimators. The performance of such methods depends on the number of features, 
number of affected features, and values of the unknown parameters. Simulations showed that 
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Figure 3: Estimated bias of several LFDR estimators for an artificial data set with 3 features 
((a) and (c)) and 15 features ((b) and (d)) and cases of high £ a i t = 6.3 ((a) and (b)) and 
low noncentrality parameter 5 a \t = 2.4 ((c) and (d)) versus increasing number of affected 
features. The LFDR estimators are MDL, LOO, L10, L 1 /20, BBE, RV, and a combination 
of MDL and BBE. 
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the corrected MLEs have very low biases in all cases when more than 50% of the features 
in the data set are affected, even for a data set with only 3 features. However, when the 
proportion of affected features is very small, these methods have excessively negative biases. 
In contrast, BBE and RV have excessively large biases when there is a high proportion 
of affected features. Furthermore, this bias increases as the number of affected features 
increases in the data set. Therefore, the weighted combination of an adjusted-MLE (MDL) 
and a conservative method (RV or BBE) may represent the safest solution for a general 
scenario in which the number of affected features is unknown. 



Colophon 



We used the following packages of R (R Development Core Team, 2008): Biobase (Gentle- 



man et al., 2004) and qvalue (Dabney et al. , 2011 ) from Bioconductor; locf dr (Efron, 2007) 



fBasics (Wuertz, 2010), and distr (Ruckdeschel et al. 2006) from the CRAN repository. 



Appendix A: Methods motivating the new MDL method 

This appendix uses the MDL principle to explain the statistical methods that led to the MDL 



method defined by equation (13). This appendix also has results that lay the foundation 
for the operating characteristics of the estimator given in Appendix B. A simple explanation 
of basic MDL-theoretic ideas in terms of hypothesis testing is available in the appendix of 



Bickel 


( 


2011b 


). See 


Rissanen 


( 


2007 


>, 


Barron et al. 


( 


1998 


), 


Griinwald 


( 


2007 


), and 


Bickel 



(2011a) for other introductions to the MDL principle of model selection. 

Since 9 a \ t is unknown, it will be replaced with a parameter value chosen to minimize the 
codelength of the data according to MDL theory, in which the length of a codeword is the 
number of independently selected binary digits of equal probability that achieve the joint 



probability of that codeword (Rissanen, 2007). The availability of measurements pertaining 



to features other than the inference target enables the construction of a universal codelength 
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function and a close approximation that is computationally more convenient. The idea that 
statistical inference minimizes universal codelength functions is called the MDL principle 
and is often formalized in terms of minimax problems. 



Minimum description length concepts 

The theory of this section is presented in terms of a parametric family that is free from 
nuisance parameters. In many cases, such a family can be derived using one of the data 
reduction methods of Section |2] 

Under the MDL framework, each scheme f for coding the data under the alternative 
hypothesis corresponds to a codelength function on X and thus to a compressing prob- 
ability density function g< selected from the parametric family {gg : 9 G 0} before observ- 
ing T(x), the realized value of the statistic, with the goal of minimizing the codelength 
L' (T (x)) = — log g* (T (x)). Since 8q is known, the probability density function of the statis- 
tic under the null hypothesis is known to be gg , which compresses the data with respect to 
the null model. Accordingly, the codelength function L° relative to the null hypothesis is 
that specified by L° (T (x)) — — loggg (T (x)). Since the base of the logarithm is arbitrary, 
the inverse logarithm is denoted by log -1 • rather than by exp • or by 2V 

Suppose, as in Example |3j that there is a vector x, of measurements for each of the iV 
features and that the data are reduced to the statistics T (xi) , . . . , T (x^). With L\ (T (x^)) 
as the codelength of T (xj) relative to the alternative hypothesis, A\ (T (xj)) = L,\ (T (x^)) — 
L° (T(xj)) is the information in T (xj) for discrimination favoring the null hypothesis over 
the alternative hypothesis; cf. Bickel ( 2011b[a ). A difference in null and alternative code- 



lengths has been called a "universal test statistic" (Rissanen, 1987); however, that term can 
cause confusion with T (Aj). 

Example 4. If the restriction to a parametric family were relaxed, 

- log ; : = - log + log — ; (20) 
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would be the information for discrimination according to the empirical Bayes methodology 
of Section Q] 



The regret (iGriinwaldl 20071) of the codelength function L\ is 



reg (gl,Xi) = L\ (T(xi)) - inf (-log^ (T (x t ))) = - log 



see 



9e ( T 0*))' 



where L[ is given by L\ (T (xi)) = —\ogg\{T(xi)) and where 6 = argsup 0g0 g$ (T (x)). 
Likewise, the regret of the codelength function relative to the null hypothesis is reg (go i x i) = 
-log (g 0o (T (Xi))/g § (T (x<))). 



While the sign of A\ (T (xi)) indicates which hypothesis is favored (Rissanen, 1987), it 
can also be compared to a threshold J of the minimum amount of information considered 
sufficient for selecting one hypothesis over the other. In that case, the probability of observing 
misleading information for discrimination has an upper bound for any distributions gg and 
g\. Specifically, for any J > 0, 

Pe Q , x (Aj (T (X,)) < - j) = P eo , x (g\ (T (X,)) / g do (T (X,)) > log" 1 j) < 1/ log^ 1 J. (21) 



Applications to the probability of observing misleading evidence appear in Royall (2000). 



A derivation from the Markov inequality appears in Bickel (2012). Since the derivation 
assumes that gg Q and g\ are genuine probability density functions, formula (21) does not 



necessarily hold for pseudo-likelihoods such as profile likelihoods and likelihoods integrated 
with respect to an improper prior; however, it does hold for all marginal and conditional 



likelihoods (Royall, 2000) 



The following two schemes (f and f ) for coding the reduced data give essentially identical 
regrets for a sufficiently large value of N. 
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Exact codelength 

While the codelength function L\ for the ith feature cannot depend on Xj, it may depend 
on Xj for all j ^ i as follows. For all i = 1, . . . , N, define L\ such that the corresponding 
probability density function g\ is equal to g e \ for the value 6\ such that 

i 

e\ = arg inf V min (reg (g g , xj) , reg (ge ,Xj)) ■ (22) 

In words, the code for a given feature uses the distribution in the parametric family that 
minimizes the regret summed over all other features. 

Proportional to N 2 , the computation time can prohibit the use of the universal compres- 
sion method for large N. For example, iV can be in the tens of thousands for gene expression 
microarrays or in the hundreds of thousands for genome-wide association studies. The next 
coding scheme overcomes this problem because its computation time is proportional to N. 

Approximate codelength 

The f coding scheme is efficiently approximated by a slightly illegal scheme denoted by $. 
It determines the codelength for statistic T (xi) under the alternative hypothesis by using 
a common probability density function g* that is in the parametric family, i.e., g* = g Q \ for 
some ^6 0. This is accomplished by minimizing the regret over all features 

N 

9 t = arg inf V min (reg (g e , xj) , reg (ge ,Xj)) . (23) 

3=1 

This coding scheme is technically illegal in the sense that g*, as a function of the observed data 
for each feature, depends on hindsight. However, under general conditions, 9^ approximates 
e\ for alH = 1, . . . , N given sufficiently large N because the selection of the distribution 
depends on all features without giving undue weight to any single feature. The approximation 
is supported by the fact that both 9^ and 0* are maximum likelihood estimates of 9 under 
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the alternative hypothesis: 

Theorem 1. Assume that for some ^£6 and 9 a u G such that 8 ^ 9 a u and that for all 
j G {!,..., N}, each statistic T (Xj) has probability density g$. with 9j G {0o,9 a lt} an d ^ s 
independent of every T (X k ) with k G {1, . . . , N} \ {j}. It follows that 0*, if unique, is the 
maximum likelihood estimate of 9 a i t . 



Proof. Using equation (23) 



N 



6 X = arginf y~]min (— log go (T (xj)) , — logg do (T (xj))) 

9 

N 

= arg sup max (log ^(T^)), logg 6o (T (xj))) 



eee 3=1 



N 



arg sup sup V] lo S 9e s (T (xj)) 
eee ee{e ,9 M } N j= i 



N 



= argsup sup TTp fl (T (xj)) , 
960 oe{e ,e alt } N fJi 

where 6 = (9\, . . . , On) and {9o, ^ait}^ is the iV-factor Cartesian product {9o, #ait} x • • • x 

{#0,#alt}- □ 

Corollary 1. Under the assumptions of Theorem^ i G {1, . . . , N}, if 9[ is unique, then 
it is the maximum likelihood estimate of 9 a u on the basis of the outcomes Xj = Xj for all 

je{l,...,N}\{i}. 

Proof. The claim reduces to that of Theorem [T] because the data are equivalent except 
for the presence or absence of the outcome T (Xj) = T (x j) and because 9\ and 9 X are 
equivalent, except for the presence or absence of the term involving that outcome. Thus, for 
all i G {1, . . . , N}, 

6\ = arg sup sup g g . (T (xj)) . 

eee ee{9 Q ,e alt } N j+i 

□ 
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Theorem [3] of the next subsection specifies sufficient conditions for the convergence of 
9* — d\ to as N increases. 

The coding method of the section entitled "Exact codelength" is universal in the sense 
that it asymptotically compresses the data as much as the noiseless coding theorem allows 



for any distribution in the parametric family (cf. |Rissanen (2007, §3.7) and Griinwald (2007 
§6.5)). Sufficient conditions for universality are stated in the following lemma, in which 
strong consistency means almost sure convergence to a parameter value as n — > oo if each 
T (Xi) is stationary and, at fixed n, of a density function in {gg : 9 G 0}. (The dependence 
of go on n is suppressed.) Such convergence will be denoted by A. 

Lemma 1 (Consistency). Suppose that for some 9q G and 9 a \ t G such that 9q ^ 9 & \ t and 
that for all j G {1, . . . , iV} ; each statistic T (Xj) has probability density gg. with 9j G {9q, $ait} 
such that 9j = 9 a \ t for at least two values of j in {1, . . . , N}. Suppose further that g, (T (Xj)) 
is almost surely continuous on for all j G {1, . . . , N}. If, for some i G {!,..., N}, 
9\ is unique and 9j = argsup eg6 g$ (T (Xj)) is a strongly consistent estimate of 9j for all 
j G {1, . . . , iV} \ {i}, then 9\ is a strongly consistent estimate of 9 & \ t . 

Proof. Let Z = {j '■ 9j = # a it, j G {1, . . . , iV} \ {i}}, which by assumption is nonempty. By 
the consistency condition, 9j A 6^ for all j G 3 and 9j A 9 for all j G {1, . . . , iV} \$. Thus, 
with probability 1, 

Hg ej (T(Xj)) = Y[g e , t (T(Xj)) J] gg (T (Xj)) 
= l[g §] (T(X J )) 

= l[m^(g 9alt (T(Xj)),g eo (T(Xj))) 

= sup TT max (g e (T (Xj)) , g do (T (Xj))) 
eee 

in the limit as n — > oo, with the equalities holding by the almost-sure continuity of g, (T (Xj)) 
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as a function on (Serfling, 1980, §1.7). Because by equation (22), 



9\ = arg sup max (9o (T (X d )) , g 6o (T (Xj))) , 



3^1 



it follows that d\ 4 0<. □ 

Heuristically, the key observation of the proof is that whether 9 is constrained to have 
one of the two values has no asymptotic effect on the estimates of 9j. The universality of 
the codelength function is a consequence. 

Theorem 2 (Universality). Under the conditions of Lemma^ 

]haE ( ( te))) = , imB , /-iog^,(r W )) 



/or a// 2 G {1, ...,AT} snc/i i/iai = # a i t , where Eg alt signifies the expectation value with 
respect to g 6ait , i.e., Eg M (•) = / •dP 0M . 

Proof. Pg M ^limn^oo 6\ G {9 , 9^}) — 1 for all i G {1, . . . , N} because 9\ A 6>j by the lemma 
and 9i G {6> , a it} by assumption. Hence, 9\ A # a it for all i G {1, . . . , A^} such that 9i = 9 a \ t . 
Thus, for those values of i, 

/-log (g 6M (T(X«))/^t 
hm £ 0alt ^ '- I = 

n->oo alt I n 

because ge^iT (Xj)) /g s t (T (Xj)) A 1 by the almost-sure continuity of g,(T(Xj)) as a 



function on (Serfling, 1980, §1.7). □ 



The iV — > oo universally of a related mixture code will be established in Appendix B. 
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Asymptotic characteristics of 0* and 9\ 

Assume X 1 ,X 2 , . . . are independent and each of identical distribution P*. For example, P + 
could be a A'-component mixture distribution P* = Y^k=i ^kP-kk-, where tt^ is the probability 
that some Xj has distribution P^, which is not necessarily in {Pe t \ '■ G 0, A G A}. Let 
E± (•) and A denote the expectation value and almost-sure convergence as iV — > oo with 
respect to P+. 

Theorem 3. Suppose that, for alH G {1, . . . , N}, E* (\ogg e (T (Xj))) < oo for all G 
and that 0* and 9\ are unique with P^-probability 1. Then d x -0\ 4 for alii G {1, . . . ,N}. 

Proof. For any G 0, let 0,(0) = arg max^ ^ ^(T (xj)). Because log g §j {e) (T (Xj ) ) 
is IID for all j G {1, . . . ,N}, the strong law of large numbers implies that, for all G 
{{l,...,iV},{l,...,iV}\{l},. ..,{!,. ..,N}\{N}}, 

jj- J2 ^gg §m (T (Xj)) ^ E* (\ogg ¥e) (T (Xj))) 

= Pi, (§j (0) = O ) E* (\ogg §m (T (Xj)) |0, (0) = O ) 

+P* (§j (0) = d) E* (\ogg §m (T (Xj)) |0, (6) = e) , 

the fmiteness of which follows from that of E+ (\oggg (T (Xj))). As the result holds for 
arbitrary G 0, 

arg sup yj—r lo g#e» (T (Xj)) 4 
see \Jn\ rri n ' 

3&JN 

arg sup E„ (log^ w (T (X,))) 

irrespective of whether the sum on the left-hand-side is over {1, . . . , N} or over {1, . . . , N} \ {i} 
for some % G {1, . . . , N}. (The uniqueness of the maximizing value of on the left-hand-side 
is guaranteed by the postulated uniqueness of 0* and 9\.) Therefore, the difference in the 
maximum likelihood estimate of under the alternative hypothesis using Xi, . . . ,X N and 
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that using Xi, . . . , Xj_i, X i+1 , . . . , X^v converges almost surely to 0; however, such maximum 
likelihood estimates are and 9\, respectively, according to Theorem [l] and Corollary [IJ □ 

Appendix B: Asymptotic characteristics of MDL and LOO 

This section extends the fixed-component results of Appendix A to the two-component mix- 
ture density of equation ^ with the constraint that g a i t = gg aH for some a i t G 0. In this 
setting, the universal density g\ and its approximation are replaced with g 4 MDL = ^mdl 



and its approximation g L0 ° = g 6 Loo, where (^ DL ,7Toi DL ) are given by equation (12). (Yang 



and Bickel (2010) compared the performance of g$ and g L0 ° by simulation.) 

Assuming the statistics are independent, (6 I J MDL , vr^ DL ) and (# L0 °, vTq 00 ) are clearly max- 
imum likelihood estimates of (9 a \ t , 7r ). Consequently, the steps used to prove Theorem [3] also 



demonstrate that 9 h0 ° - Of DL A and - tt^ dl A for all i G {1, . . . , N} under the 
independence condition. The mixture codes form LFDR estimates via substituting either 
6 MDL and vr^ DL or 6 L0 ° and vr^ 00 into equations @ and Q. 

Whereas regularity conditions entailing the strong consistency of maximum likelihood 



estimates for finite-mixture models (Redner and Walker, 1984) would apply as N — > oo 



seemingly more pertinent to universality is consistency in the sense of A, which is almost- 
sure convergence as n — > oo under the stationarity of every T(Xj). However, such A 
consistency does not hold if N is finite and if 7r > 0, for in that case, there is fixed, nonzero 
probability tTq that all N statistics have probability density function gg rather than go &lt - 
Therefore, consistency will be used instead. 

Theorem 4. If the maximum likelihood estimate 8 h0 ° almost surely converges to 6 a u as 
N — > oo and if g, (T (Xj)) is almost surely continuous on © for all i G {1, 2, . . . } ; then 



lim E„ M (L* (T (X)) /n) = E Bm (- log ^ (T (X)) /n) 

N—>co 



for all i G {1, 2, ... } such that Q; L = & \ t , where L* (T (Xj)) = — log (^mdl (T (X.,-)) and Eg : 
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signifies the expectation value with respect to ge alt , i-G., E$ alt (•) = J •dPg^. 

Proof. Since 6^ DL is the maximum likelihood estimate for the N — 1 statistics other than 
T (Xt), fl, MDL 4 6 alt . Thus, the claim follows from reasoning analogous to that used to prove 
Theorem |2l □ 

Corollary 2 (Asymptotic universality). Given the conditions of Theorem^ 

L i ( T ( X i)Y 



n 



logger {T{X 3 )) 



n 



lim lim Eg a 

n— s>oo N— >oo 

= lim E e M 

n— ¥oo 

for all i G {1,2,...} such that 9{ = 9 a i t . 

The proof is trivial. The corollary means that 

(LJ (T (*,)) - log (1 - < DL )) - (L° (T (*«)) - log*™) 

may be regarded as approaching the information for discrimination under the mixture model 
as N ->■ oo. Since 8 L0 ° - ^ DL A and tTq 00 - 7r^ DL 4 0, that information is approxi- 
mated by substituting the maximum likelihood estimates 8 L0 ° and vTq 00 for 6 l l MDL and 7r^ DL , 
respectively. 
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